Goto

Collaborating Authors

 safety measure


Elon Musk's Grok 'Undressing' Problem Isn't Fixed

WIRED

X has placed more restrictions on Grok's ability to generate explicit AI images, but tests show that the updates have created a patchwork of limitations that fail to fully address the issue. Elon Musk's X has introduced new restrictions stopping people from editing and generating images of real people in bikinis or other "revealing clothing." The change in policy on Wednesday night follows global outrage at Grok being used to generate thousands of harmful non-consensual "undressing" photos of women and sexualized images of apparent minors on X. However, while it appears that some safety measures have finally been introduced to Grok's image generation on X, the standalone Grok app and website seem to still be able to generate "undress" style images and pornographic content, according to multiple tests by researchers, WIRED, and other journalists. Other users, meanwhile, say they're no longer to create images and videos as they once were.


A Closer Look at the Robustness of Contrastive Language-Image Pre-Training (CLIP)

Neural Information Processing Systems

Contrastive Language-Image Pre-training (CLIP) models have demonstrated remarkable generalization capabilities across multiple challenging distribution shifts. However, there is still much to be explored in terms of their robustness to the variations of specific visual factors. In real-world applications, reliable and safe systems must consider other safety measures beyond classification accuracy, such as predictive uncertainty. Yet, the effectiveness of CLIP models on such safety-related objectives is less-explored. Driven by the above, this work comprehensively investigates the safety measures of CLIP models, specifically focusing on three key properties: resilience to visual factor variations, calibrated uncertainty estimations, and the ability to detect anomalous inputs.


Engineers propose massive airbags for airplanes

Popular Science

The system uses an AI model that would trigger a Kevlar bubble cocoon in the event of a crash. 'REBIRTH is more than engineering--it's a response to grief,' the researchers wrote. Breakthroughs, discoveries, and DIY tips sent every weekday. An Air India flight from Ahmedabad bound for London spent just 30 seconds in the air before disaster struck earlier this year . Preliminary reports indicate that the aircraft's fuel control switches were inexplicably turned off shortly after takeoff, cutting fuel to the engines and causing total power loss.

  Country:
  Genre: Research Report (0.55)
  Industry:

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Korbak, Tomek, Balesni, Mikita, Barnes, Elizabeth, Bengio, Yoshua, Benton, Joe, Bloom, Joseph, Chen, Mark, Cooney, Alan, Dafoe, Allan, Dragan, Anca, Emmons, Scott, Evans, Owain, Farhi, David, Greenblatt, Ryan, Hendrycks, Dan, Hobbhahn, Marius, Hubinger, Evan, Irving, Geoffrey, Jenner, Erik, Kokotajlo, Daniel, Krakovna, Victoria, Legg, Shane, Lindner, David, Luan, David, Mądry, Aleksander, Michael, Julian, Nanda, Neel, Orr, Dave, Pachocki, Jakub, Perez, Ethan, Phuong, Mary, Roger, Fabien, Saxe, Joshua, Shlegeris, Buck, Soto, Martín, Steinberger, Eric, Wang, Jasmine, Zaremba, Wojciech, Baker, Bowen, Shah, Rohin, Mikulik, Vlad

arXiv.org Machine Learning

AI systems that "think" in human language offer a unique opportunity for AI safety: we can monitor their chains of thought (CoT) for the intent to misbehave. Like all other known AI oversight methods, CoT monitoring is imperfect and allows some misbehavior to go unnoticed. Nevertheless, it shows promise and we recommend further research into CoT monitorability and investment in CoT monitoring alongside existing safety methods. Because CoT monitorability may be fragile, we recommend that frontier model developers consider the impact of development decisions on CoT monitorability.


Exclusive: New Claude Model Triggers Stricter Safeguards at Anthropic

TIME - Tech

This moment is a crucial test for Anthropic, a company that claims it can mitigate AI's dangers while still competing in the market. Claude is a direct competitor to ChatGPT, and brings in over 2 billion in annualized revenue. Anthropic argues that its RSP thus creates an economic incentive for itself to build safety measures in time, lest it lose customers as a result of being prevented from releasing new models. "We really don't want to impact customers," Kaplan told TIME earlier in May while Anthropic was finalizing its safety measures. "We're trying to be proactively prepared." But Anthropic's RSP--and similar commitments adopted by other AI companies--are all voluntary policies that could be changed or cast aside at will.


Safety by Measurement: A Systematic Literature Review of AI Safety Evaluation Methods

Grey, Markov, Segerie, Charbel-Raphaël

arXiv.org Artificial Intelligence

As frontier AI systems advance toward transformative capabilities, we need a parallel transformation in how we measure and evaluate these systems to ensure safety and inform governance. While benchmarks have been the primary method for estimating model capabilities, they often fail to establish true upper bounds or predict deployment behavior. This literature review consolidates the rapidly evolving field of AI safety evaluations, proposing a systematic taxonomy around three dimensions: what properties we measure, how we measure them, and how these measurements integrate into frameworks. We show how evaluations go beyond benchmarks by measuring what models can do when pushed to the limit (capabilities), the behavioral tendencies exhibited by default (propensities), and whether our safety measures remain effective even when faced with subversive adversarial AI (control). These properties are measured through behavioral techniques like scaffolding, red teaming and supervised fine-tuning, alongside internal techniques such as representation analysis and mechanistic interpretability. We provide deeper explanations of some safety-critical capabilities like cybersecurity exploitation, deception, autonomous replication, and situational awareness, alongside concerning propensities like power-seeking and scheming. The review explores how these evaluation methods integrate into governance frameworks to translate results into concrete development decisions. We also highlight challenges to safety evaluations - proving absence of capabilities, potential model sandbagging, and incentives for "safetywashing" - while identifying promising research directions. By synthesizing scattered resources, this literature review aims to provide a central reference point for understanding AI safety evaluations.


AI Companies Should Report Pre- and Post-Mitigation Safety Evaluations

Bowen, Dillon, Dombrowski, Ann-Kathrin, Gleave, Adam, Cundy, Chris

arXiv.org Artificial Intelligence

The rapid advancement of AI systems has raised widespread concerns about potential harms of frontier AI systems and the need for responsible evaluation and oversight. In this position paper, we argue that frontier AI companies should report both pre- and post-mitigation safety evaluations to enable informed policy decisions. Evaluating models at both stages provides policymakers with essential evidence to regulate deployment, access, and safety standards. We show that relying on either in isolation can create a misleading picture of model safety. Our analysis of AI safety disclosures from leading frontier labs identifies three critical gaps: (1) companies rarely evaluate both pre- and post-mitigation versions, (2) evaluation methods lack standardization, and (3) reported results are often too vague to inform policy. To address these issues, we recommend mandatory disclosure of pre- and post-mitigation capabilities to approved government bodies, standardized evaluation methods, and minimum transparency requirements for public safety reporting. These ensure that policymakers and regulators can craft targeted safety measures, assess deployment risks, and scrutinize companies' safety claims effectively.


DiffGuard: Text-Based Safety Checker for Diffusion Models

Khader, Massine El, Bouzidi, Elias Al, Oumida, Abdellah, Sbaihi, Mohammed, Binard, Eliott, Poli, Jean-Philippe, Ouerdane, Wassila, Addad, Boussad, Kapusta, Katarzyna

arXiv.org Artificial Intelligence

Recent advances in Diffusion Models have enabled the generation of images from text, with powerful closed-source models like DALL-E and Midjourney leading the way. However, open-source alternatives, such as StabilityAI's Stable Diffusion, offer comparable capabilities. These open-source models, hosted on Hugging Face, come equipped with ethical filter protections designed to prevent the generation of explicit images. This paper reveals first their limitations and then presents a novel text-based safety filter that outperforms existing solutions. Our research is driven by the critical need to address the misuse of AI-generated content, especially in the context of information warfare. DiffGuard enhances filtering efficacy, achieving a performance that surpasses the best existing filters by over 14%.


LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs

Zhou, Yujun, Yang, Jingdong, Guo, Kehan, Chen, Pin-Yu, Gao, Tian, Geyer, Werner, Moniz, Nuno, Chawla, Nitesh V, Zhang, Xiangliang

arXiv.org Artificial Intelligence

Laboratory accidents pose significant risks to human life and property, underscoring the importance of robust safety protocols. Despite advancements in safety training, laboratory personnel may still unknowingly engage in unsafe practices. With the increasing reliance on large language models (LLMs) for guidance in various fields, including laboratory settings, there is a growing concern about their reliability in critical safety-related decision-making. Unlike trained human researchers, LLMs lack formal lab safety education, raising questions about their ability to provide safe and accurate guidance. Existing research on LLM trustworthiness primarily focuses on issues such as ethical compliance, truthfulness, and fairness but fails to fully cover safety-critical real-world applications, like lab safety. To address this gap, we propose the Laboratory Safety Benchmark (LabSafety Bench), a comprehensive evaluation framework based on a new taxonomy aligned with Occupational Safety and Health Administration (OSHA) protocols. This benchmark includes 765 multiple-choice questions verified by human experts, assessing LLMs and vision language models (VLMs) performance in lab safety contexts. Our evaluations demonstrate that while GPT-4o outperforms human participants, it is still prone to critical errors, highlighting the risks of relying on LLMs in safety-critical environments. Our findings emphasize the need for specialized benchmarks to accurately assess the trustworthiness of LLMs in real-world safety applications.


A Closer Look at the Robustness of Contrastive Language-Image Pre-Training (CLIP)

Neural Information Processing Systems

Contrastive Language-Image Pre-training (CLIP) models have demonstrated remarkable generalization capabilities across multiple challenging distribution shifts. However, there is still much to be explored in terms of their robustness to the variations of specific visual factors. In real-world applications, reliable and safe systems must consider other safety measures beyond classification accuracy, such as predictive uncertainty. Yet, the effectiveness of CLIP models on such safety-related objectives is less-explored. Driven by the above, this work comprehensively investigates the safety measures of CLIP models, specifically focusing on three key properties: resilience to visual factor variations, calibrated uncertainty estimations, and the ability to detect anomalous inputs.